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Abstract 


We considered the fully overlapping triplets of nucleotide bases and proposed a 2D graphical representation of protein sequences con- 
sisting of 20 amino acids and a stop code. Based on this 2D graphical representation, we outlined a new approach to analyze the phy- 
logenetic relationships of coronaviruses by constructing a covariance matrix. The evolutionary distances are obtained through measuring 


the differences among the two-dimensional curves. 
© 2006 Elsevier B.V. All rights reserved. 


1. Introduction 


Compilation of DNA primary sequence data continues 
unabated and tends to overwhelm us with voluminous out- 
puts that increase daily. Comparison of primary sequences 
of different DNA strands remains one of the important 
aspect of the analysis of DNA data banks. Mathematical 
analysis of the large volume genomic DNA sequence data 
is one of the challenges for bio-scientists. There are three 
class methods for the analysis of DNA sequences: (1) Align- 
ment [1,2]. (41) Matrices: (1) matrices in which an individual 
entry corresponds to an individual pair of bases [3,6,7] and 
(2) matrices in which entries summarize information of dif- 
ferent X—Y pairs of bases [4,5,7]. (111) Graphical representa- 
tion: Graphical representation of DNA sequence provides 
a simple way of viewing, sorting and comparing various 
gene structures. Graphical techniques have emerged as a 
very powerful tool for the visualization and analysis of long 
DNA sequences. These techniques provide useful insights 
into local and global characteristics and the occurrences, 
variations and repetition of the nucleotides along a 
sequence which are not as easily obtainable by other meth- 
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ods. In recent years several authors outlined different 
graphical representation of DNA sequences based on 2D, 
3D or 4D [8-20]. Based on these graphical representation, 
several authors outlined some approaches to make com- 
parison of DNA sequences [21-25]. 

All this methods are based on the (four letter alphabet, 
A, C, G, and T standing for nucleotide bases adenine, cyto- 
sine, guanine, and thymine, respectively). We will change to 
consider the fully overlapping triplets of nucleotide bases. 
Consideration of triplets of nucleotide bases instead of 
individual nucleotide bases has several reasons and advan- 
tages. There are three of them: (1) The genetic code consists 
of triplets (codons) of DNA (or RNA in some virus) nucle- 
otides. (11) The second advantage is that one can easily find 
the open reading frame as the longest sequence of triplets 
that contains no stop codons when read in a single reading 
frame. (111) The computation will become more simple. 

In this Letter, we proposed a 2D graphical representa- 
tion of the protein sequences consisting of 20 amino acids 
and a stop code. Based on this 2D graphical representation, 
we outlined a new approach to analyze the phylogenetic 
relationships of coronaviruses. The evolutionary distances 
are obtained through measuring the differences among 
the two-dimensional curves. Unlike most existing phylog- 
eny construction methods [26-31], the proposed method 
does not require multiple alignment. 
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2. 2D graphical representation of protein sequences and 
properties 


As is known, all of the 64 triplets of nucleotide bases 
correspond 20 amino acids and a stop code. There are 
three reading frame start at position 1, 2 and 3, respec- 
tively. Using the translate tool, we can obtain three pro- 
tein sequences consisting of 20 amino acids and a stop 
code. The 20 amino acids found in proteins can be 
grouped according to the chemistry of their R groups 
as in [32]: amino acids A,V,F,P,M,I,L belong to the 
hydrophobic chemical group; amino acids D,E,K,R 
belong to charged chemical group; amino _ acids 
S,T,Y,H,C,N,Q,W belong to polar chemical group; 
amino acid belong to glycine chemical group. Then for 
any DNA sequence, we will transform it into three new 
sequences defined over alphabet {H,C,P,G}. The rule 
is as follows: 


H if ¢(3i—2,3i— 1,31) =A,V,F,P,M,I,L 

_ JC if g3i—2,3i- 1,31) =D,E,K,R 

— ) P. if g(3i—2,3i- 1,31) =S,T,Y,H,C,N,G,W 
Gif g(3i—2,3i—1,3i) =G,- 


As shown in Fig. 1, we construct a pyrimidine—purine 
graph on two quadrants of the cartesian coordinate sys- 
tem, with pyrimidines (P and C) in the first quadrant 
and purines (H and G) in the fourth quadrant. The unit 


vectors representing four alphabets H,G,C and P are as 
follows: 


(m, —v/n) — H, (/n,m) — G, (/n, m) — C,(m, Vn) ae 


where m is a real number and m ¥ \/n, n is a positive real 
number but not a perfect square number. So that we will 
reduce a DNA sequence into a series of nodes Po,P ,P>, 
...,P) 73), whose coordinates x;, y; (i=0,1,2,...,LN/3], 
where N is the length of the DNA sequence being studied) 
satisfy 


* = him + Zi,/n + Civ/n + p\m 
Y, = —hj/n — Bm + Em + pir/n 
h;,é;,2; and p; satisfy 


(1) 


Fig. 1. Pyrimidine—purine graph. 


h, = A; + J31Vi + JF i + JP; 
+ VSM + 551i + JEL 
C; = Dj; + \/S7E; + /8gKi + /SoRi 


@,=S;+ Jsolit Jsulit Jsot: 2) 
+V13C; + fsraNi + /5150, + S516: 
Di = Gi + 8179; 
where A;,,V;,F;,P;,M;,1;,L;,Dj,Ei,Ki,Ri,Si1;, Vi,Hi,C;, NiO 


W,,G;,Q2;,; are the cumulative occurrence numbers of 4, 
VF, P,M,1L,D,E,K,R,S,T, Y,H,C,N,O,W,G and —(or stop 
code), respectively, in the subsequence from the Ist base 
to the ith base in the sequence. And s,,k=1,...,17 are 
positive real number but not perfect square number, 
$;A Sif =1,...,17, and mA /xy,m A ./ns,, m/s, A /n, 
| eee Wz We define Ay = Vo = Fo = Po = Mo = 1h = 
Lo = Do = Lo = Ko = Ro = So = 10 = Yo = Ho = Co = No = 
Oo = Wo = Go = 2 = 0. 

We called the corresponding plot set be characteristic 
plot set. The curve connected all plots of the characteristic 
plot set in turn is called characteristic curve, which 1s 
determined by m, n, that satisfy above mentioned condi- 
tion. In Figs. 2-4, we show the SARS corresponding curves 
with different parameters n and m, where s; = 2/335 = 3/4; 
53 = 4/5354 = 5/6355 = 6/7356 = 7/8357 = 8/9:52 = 9/10359 = 
10/11; S19 = 11/1235), _ 12/13;515 = 13/14:513 _ 14/15; 
Sy4 = 15/1635;5 = 16/173516 = 17/1835,7 = 18/19. Observing 
Figs. 2-4, we find SARS have similar curves despite with 
different parameters n and m. 


Property 1. For a given DNA sequence there are three 2D 
representations corresponding to it. 


Proof. Using the translate tool, one can obtain three pro- 
tein sequences consisting of 20 amino acids and a stop code 
corresponding three reading frame start at position 1, 2 and 
3. In a single reading frame, let (x;, y;) be the coordinates of 
the ith amino acid of protein sequence, then we have 
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Fig. 2. SARS corresponding curve with different parameters n and m 
based on the first reading frame. 
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Fig. 3. SARS corresponding curve with different parameters n and m 

based on the second reading frame. 
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Fig. 4. SARS corresponding curve with different parameters n and m 
based on the third reading frame. 


h;(m, —,/n) + Bi(V/n, —m) =I C;(./n, m) + pi(m, Vn) = (xi, y;) 
1.€., 
him + S/n + Gix/n + pm = x; 
—hj/n — Rm + Gm + Pir/n = y; 
LJ 


(3) 


Obviously, x; and y; are irrational numbers of form 
sm +k,/n, where s and k are integers. We suppose 


x; = sym + kyv/n 
y; = sym + ky/n 


then we have 


h; + Bi = 8x 

g, +; = k, 

ae (4) 
—2i + Cj = Sy 


—h; + DP; — ky 


So, for given x-projection and y-projection of any point 
P=(x, y) on the sequence, after uniquely determining 
SyKySy ky, from x and y, the number 4,,V,,F,,P,, 
Moly, Lp,D pL p, Kp, RpsSps 1 ps Vpt1psCp,Np»Qp,Wp,Gp,Qp of 
A,V,F,P,M,I,L,D,E,K,R,S,T,Y,H,C,N,O,W,G and -—(or 
stop code) from the beginning of the sequence to the point 
P can be found by solving linear system (2) and (4). 

The vector pointing to the point P; from the origin O is 
denoted by r;. The component of r;, 1.e. x; and y; are calcu- 
lated by Eqs. (1) and (2). Let Ar; =r; — r; — ;, then we have 
Property 2. 


Property 2. For any i=1,2,...,N’, where N'is the length 
of protein sequence corresponding the studied DNA 
sequence, the vector Ar; has only twenty one_ possible 
direction. Furthermore, the length of Ar;, 1.e.,|A r;|, is always 
equal to s,(m*+n), for any i=1,2,...,N, k=0,l,..., 
17,50 =I. 


Proof. Actually, the components of Ar;, 1.e., Ax; and Ay; 
can be calculated for each possible residue (A, 
V.F,P,M.I,L,D,E,K,R,S,T, Y,H,C,N,O,W,G and —) at the 
ith position of the protein sequence by using Eqs. (1) and 
(2). For example, when the ith residue is A, we find Ax; = m 
and Ay, = —,/n. This result is independent of the confor- 
mation state of the (i— 1)th residue. The two numbers 
(m, —/n) are called the direction of Ar;. The direction num- 
ber and the length of Ar; for each possible residue type at 
the ith position are summarized. LU 


Property 3. There is no circuit or degeneracy in our two- 
dimensional graphical representation. 


Proof. We assume that: (1) the number of amino acid 
forming a circuit is /; (2) the number of 4,V,F,P,M,LL,- 
D,E,K,R,S, T,Y,H,C,N,QO,W,G and —(or stop code) in a 
circuit is a’,v’,f’,p’,m’,i',l',d’,e',k’,r’,s',t',y',h’,c’.n'.g',w 2 
and 0’, respectively. So a’t+ou+f't+tp'+m'+i'+ 
Pord = 6 ak ae Peart ay a ao a ag a 
a oO =, Because a'A,v' Vf Fp'P,m'M,i'L!'L,d'D, 
e'E\k'K,r' Rys'S,t'T,y' Y,h'H,c'C, n'N,g'O,w'W,g'G and 0’ 
—(or stop code) form a circuit, the following equation 
holds: 


hi =a + Jsyu' + Jsof! + /s3p! + sam! + V/ssi!’ + \/sol' 
cl =d + Vf/sje! + \/sgk' + \/sor’ 
gi =s' + Jsot + Joy + Ssh! + V13c + Jsian' (5) 
+Jsisg’ + 
p=2 4+ Vs0 
h'(m, —V/n) + g'(o/n, —m) + e'(/n,m) + p'(m, Vn) = (0,0) 


ICs, 


him+g'/n+c/n+p'm=0 
—h' Jn—-gim+cm+t/n=0 


Sig6W. 
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Clearly Eqs. (5) and (6) hold if, and only if ad’=v'= 
=p =n SS Se = Sk Se = ST Sy Ss = 
c=n'=q =w' =g'=0'=0. Therefore, / =0, which 
means no circuit exists in this graphical representation. LL 


Property 4. The 2D representation possesses the reflection 
Symmetry. 


Proof. usually the sequence 1s expressed in the order from 
5’ to 3’. Suppose that the 2D representation for protein 
sequence is described by (x;, y,),i=0,1,2,...,N. Suppose 
again that the 2D representation for the reverse sequence, 
1.e, the same sequence but from 3’ to 5’ is described by 
(x;,9;), we find 

Xj = Xy — Xn 
“a 7 

Yi = Yn — YN-i 

C 


3. Phylogenetic tree of coronaviruses 


For any DNA sequence, we have three translating pro- 
tein sequences. For any protein sequence, we have a set of 
points (x; y,;),é= 1,2,3,...,N, where N is the length of 
the sequence. The coordinates of the geometrical center 
of the points, denoted by x° and y°, may be calculated 
as follows: 


ao » Le 
x 0 = yd (8) 


The element of covariance matrix CM of the points are 
defined: 


N 
CM, = % S0(xi — x°) (x; — 2°) 
I 
it 0 0 
CM, =H dae —x°)(¥, —¥) = CM, (9) 
LS 0 0 
CMyy = 5 Li — VO —¥") 


(See Table 1)The above four numbers give a quantitative 
description of a set of point (x;, y;),i=1,2,...,N, scattering 
in a two-dimensional space. Obviously, the matrix is a real 
symmetric 2 x 2 one. There is a leading eigenvalue for a 
matrix CM. So that there are three geometrical centers and 
three leading eigenvalue corresponding a DNA sequence. 
In Table 2, we list the geometrical centers (x?,y?),k = 
1,2,3 and leading eigenvalues belonging to 24 species 
with parameter m =4,n =3,58, = 2/3; 5) = 3/4;5; =4/5; 
84 = 5/6; 55 = 6/7; 56 = 7/8357 = 8/9; 53 = 9/10; 59 = 10/11; 
Si9 = 11/12; sy = 12/13; sp = 13/14; sp = 14/15; S14 = 15/16; 
815 = 16/17; 516 = 17/18;5,7 = 18/19 (See Table 3). 

In order to facilitate the quantitative comparison of dif- 
ferent species in terms of their collective parameters, we 
introduce a distance scale as defined below. Suppose that 
there are two species i and j, the parameters are 

ee re A, 24, respectively, where ii, 25, 43 are the three 
leading eigenvalues of matrix CM; corresponding to species 
i. The distance dj between the two points is 


A aye OHA ad 12g (0) 


Genome Length (nt) 
Human coronavirus 229E 27317 
Transmissible gastroenteritis virus 28 586 
Porcine epidemic diarrhea virus 28 033 
Bovine coronavirus strain Mebus 31032 
Bovine coronavirus isolate BCoV-LUN 31028 
Bovine coronavirus Quebec 31100 
Bovine coronavirus 31028 
Murine hepatitis virus strain ML-10 31233 
Murine hepatitis virus strain 2 31276 
Murine hepatitis virus strain Penn 97-1 31112 
Murine hepatitis virus 31357 
Avian infectious bronchitis virus 27 608 
SARS coronavirus BJ01 29725 
SARS coronavirus Urbani 29727 
SARS coronavirus HKU-39849 29742 
SARS coronavirus CUHK-W1 29 736 
SARS coronavirus CUHK-SulO 29,736 
SARS coronavirus Sin2500 29711 
SARS coronavirus $in2677 29705 
SARS coronavirus $in2679 29711 
SARS coronavirus Sin2748 29 706 
SARS coronavirus $in2774 29711 
SARS coronavirus TW1 29729 


Table 1 
The accession number, abbreviation, name and length for the 24 coronavirus geneomes 
No. Accession Abbreviation 
NC_002645 HCoV_229E 
2 NC_002306 TGEV 
3 NC_003436 PEDV 
4 U00735 BCoVM 
5 AF391542 BCoVL 
6 AF220295 BCoVQ 
7 NC_003045 BCoV 
8 AF208067 MHVM 
9 AF101929 MHV2 
10 AF208066 MHVP 
11 NC_001846 MHV 
12 NC_001451 IBV 
13 AY278488 BJO1 
14 AY278741 Urbani 
15 AY278491 HKU-39849 
16 AY278554 CUHK-W1 
1 AY282752 CUHK-Sul0 
18 AY283794 SIN2500 
19 AY283795 SIN2677 
20 AY 283796 SIN2679 
21 AY 283797 SIN2748 
22 AY 283798 SIN2774 
23 AY291451 TW1 
24 NC_004718 TOR2 


SARS coronavirus 29751 


Table 2 


Twenty one possible direction 


TSO SQAMNN AARON Sw QaSA 


AX, 


M,/S6 


gy 
Al A 
coo}; ~ 


) 
i) 
Ne) 


ey eee rm eg ee 
Dil ny} TR] wl] nlre]o 
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where d;; denotes the distance between the geometric cen- 
ters of the ith and the jth genomes, and M is the total num- 
ber of all genomes (M = 24, here). Then we obtain a real 


M x M symmetric matrix whose elements are dj. 
Accordingly, a real symmetric Mx M matrix Dj, 1s 


obtained and used to reflect the evolutionary distance 
between the species i and j. The clustering tree is 
constructed using the UPGMA method 


Table 3 


in PHYLIP 


package 


species. 


TGEV 
HCoV-299E 


BCoV 
BCoVL 
BCoV 
BCoVQ 


BJO1 


Fig. 5. Phylogenetic tree. 


Sin267TT 
Sin2748 
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TOR2 CUHK-Sul0 
UHK-W1 
HKU-39849 

TW1 
in2500 Urbani 
$in2679 
Sin2774 


(http://evolution.genetics.washington.edu/phy- 
lip.html). The final phylogenetic tree is drawn using 
the DRAwWGRAM program in the Puyuip package. In 
Fig. 5, we present the phylogenetic tree belonging to 24 


4. Conclusion 


We made a analysis of DNA sequences by considering 
the fully overlapping triplets of nucleotide bases. The pre- 
sented graphical representation can be recaptured mathe- 


The geometric centers and three leading eigenvalues for each of the 24 coronavirus genomes 


~,. 


ee ee 
OMmOmOANMNBPWNRK OHO OND NA BWN 


NNN N WH 
BRWN Fe © 


0 
xX} 


2.5692e 
2.8619e 
2.8626e 
2.8602e 
2.8688e 
2.6263e 
2.8773e 
2.8902e 
2.8853e 
2.8582e 
2.5137e 
2.7670e 
2.7255e 
2.7656e 
2.7659e 
2.7656e 
2.7239e 
2.7233e 
2.7239e 
2.7239e 
2.1/24le 
2.7647e 
2.7647e 
2.6110e 


+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 


yy 
~ 159.0439 
~230.4309 
~233,0932 
~245,6989 
~294.6379 
415.1362 
~476.9658 
446.8927 
459.6862 
~ 528.7428 
—415,8854 
_48 3996 
~35.7080 
—45,9837 
~45,2775 
—47.8004 
~ 35,1426 
36.1991 
~ 34,4434 
~35,6707 
~ 35,5425 
—48.0684 
_47,8421 
~251.1068 


0 
9) 


2.5566e 
2.8245e 
2.8231e 
2.8245e 
2.8364e 
2.5204e 
2.8773e 
2.8902e 
3.0344e 
3.0320e 
2.6893e 
2.7276e 
2.8550e 
2.7262e 
2.7260e 
2.7267e 
2.8535e 
2.8529e 
2.8535e 
2.8525e 
2.8535e 
2.7258e 
2.) 2928 
2.7585e 


+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 


y3 
~ 342.5873 
~723.2605 
~724.5553 
~743.4898 
~709.6245 
~204.5027 
~476.9658 
—446.8927 
89.5446 
34,9426 
244.2464 
~34.7759 
526.4976 
~35,1151 
~ 36.4889 
~ 33,6628 
527,335] 
527.2583 
527.8162 
525.5247 
527.2287 
~35.7184 
~ 35,8263 
459.3175 


0 
x3 


2.6794e 
2.997le 
2.9976e 
2.9985e 
3.0012¢ 
2.4666e 
2.9006e 
2.9139e 
2.89 12e 
2.8807e 
2.5817e 
2.8570e 
2.7646e 
2.8557e 
2.8558e 
2.8560e 
2.7632e 
2.1/627€ 
2.7633¢e 
2.7621e 
2.7634e 
2.8553e 
2.8547e 
2:0727¢ 


+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 
+ 003 


y3 


389.8249 
128.9913 
130.5104 
133.2708 
146.3851 
—516.9428 
—252.7994 
=221 4537 
—273.7115 
—253.2886 
—222.8666 
524.7574 
—43.8066 
528.0186 
530.0127 
527.4290 
—45.2702 
—45.4289 
—45.2775 
—43.2715 
—45.5734 
523.7099 
524.8910 
—97.0235 


Ay 


2.1520 
2.6999 
2.7034 
2.7056 
2.1519 
2.2817 
2.8910 
2.9004 
2.9146 
2.8697 
22211 
2.4705 
2.5698 
2.4675 
2.4680 
2.4680 
2.5669 
2.5657 
2.5667 
2.5678 
2.5675 
2.4661 
2.4661 
2.3913 


Ay 


22371 
2.8393 
2.8386 
2.8453 
2.8561 
2.0813 
2.8910 
2.9004 
29139 
2.9882 
2.3287 
2.5740 
2.6804 
2.59/11 
2.5710 
2.5125 
2.6777 
2.6766 
2.6780 
2.6737 
2.6777 
2.5700 
2.5692 
2.4587 


A3 


2.3707 
2.9157 
2.9178 
2.9209 
2.9158 
2.1269 
2.1932 
2.8179 
2.1829 
2.7408 
2.1831 
2.6849 
2.4654 
2.6821 
2.6828 
2.6838 
2.4630 
2.4620 
2.4630 
2.4587 
2.4636 
2.6815 
2.6808 
2.3322 
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matically without loss of textual information. And our rep- 
resentation provides a direct plotting method to denote 
DNA sequences without degeneracy. 

Most existing approaches for phylogenetic inference 
use multiple alignment of sequences and assume some 
sort of an evolutionary model. The multiple alignment 
strategy does not work for all types of data, e.g., whole 
genome phylogeny, and the evolutionary models may not 
always be correct. The current two-dimensional graphical 
representation of DNA sequences provides different 
approach for constructing phylogenetic tree. Unlike 
most existing phylogeny construction methods, the pro- 
posed method does not require multiple alignment. Also, 
both computational scientists and molecular biologists 
can use it to analysis protein sequences efficiently. We 
can obtain some graphical representation of protein 
sequence based on 2D, 3D and 4D using the following 
transform: a; — h;,g, — 2;,¢; ~ G,t; ~ P;. hj,¢,g; and 
p; satisfy Eq. (2). a;,c;,g; and t; are the cumulative occur- 
rence numbers of A, C, G and T, respectively, in the 
subsequence from the Ist base to the ith base in the 
sequence. 
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